spades.py -1 /data/lab8/illumina_reads_R1.fastq.gz -2 /data/lab8/illumina_reads_R2.fastq --pacbio /data/lab8/pacbio_reads.fastq --careful --cov-cutoff auto -o spades_assembly_all_illumina -t 1
left on screen starting 10/11/18 5:10pm PST
output in cd ~/fa18-BioE131/lab08/spades_assembly_all_illumina
Question:
Why do we expect short reads to produce a more fragmented assembly than long reads?
In general, more short reads are needed to produce the same coverage as a long reads data. The contigs are shorter and there're many contigs, also there may be regions that are not being covered in the reads due to the short read length. For example, the repetitive regions may be considered as small fragments because the read length is unable to detect the larger patterns. All of the issues above lead to a more fragmented they long reads. Also, if the sequence is de novo and so a reference doesn't exist, repeated areas can cause a lot of difficulty in sequence assembly. Additional difficulties include base substitutions by inaccurate polymerases, chimeric sequences, and PCR-bias, all of which can contribute to generating an incorrect sequence.
Why does a single-molecule sequencing like PacBio have a higher error rate than Illumina?
As PacBio requencing differs significantly from the Illumina sequencing in that there is no need for amplification during the library preparation, nor during the sequencing process, and it apply a movie capture of the polymerase activity, the signal to noise ratio is larger and so the error rate is also higher, as both mismatches and insertion proportion is larger than in a illunina sequencing.
Code Line: output a tabular analysis with headers
assembly-stats -t ./spades_assembly_all_illumina/scaffolds.fasta ./spades_assembly_all_illumina/contigs.fasta > ./assembly_stats_report.txt
import pandas as pd
data = pd.read_csv('assembly_stats_report.csv',sep = " ")
data
df1 = data[['filename','total_length',"number","N50"]]
df1
N50 is useful in that if evaluates the assembly quality in terms of contiguity, as it's defined as the senquence length of the shortest contig at 50% of the total genome length. It is useful to generate the distributive analysis and a half cut in terms of contig.
The mean and median contig length does not take into account of the contiguity and the total length of the read length being assemblied. In that way, the mean and median is not a even cut point in terms of the number of bases
Scaffolds are longer than contigs, around 50% longer N50 and much longer in terms of longest and mean segments
import pandas as pd
data = pd.read_csv('contig_cov.csv',sep = " ", header = None,names = ["name", "length","Coverage"])
# A demonstration of the first five lines
print(data[1:5])
# extract the second column as the coverage
cov = data[["Coverage"]]
cov.hist(bins=149)
# removing the single extremely high anomaly (multiple G and multiple T)for a closer look
cov2 = data[0:147][["Coverage"]]
cov2.hist(bins=140)
The coverage is not uniformly distributed, and has a closest look of bimodal
mean coverage = total mapped reads time/total reference length, and so the contig coverage = K * total mapped reads time/total reference length. A possible explanation is that the contig length is a 1/k of the total length, where k is a positive integer larger than 1, also the read time will be a 1/k' of the total read times. If both two condition is satisfied, a contig converage could be a multiple of the mean coverage.
Terminal Code:
rna_hmm3.py -i ./spades_assembly_all_illumina/contigs.fasta -o ./ran_hmm_out.txt
Inspect the output
import pandas as pd
data = pd.read_csv('ran_hmm_out.txt',sep = " ")
data
import pandas as pd
data = pd.read_csv('ran_hmm_out.GFF',sep = " ")
data
bedtools getfasta -fi ./spades_assembly_all_illumina/contigs.fasta -bed ./ran_hmm_out.GFF -fo 16s_rRNA.fastaThe output is in fasta format named 16s_rRNA.fasta. For inspection:
NODE_13_length_195508_cov_8.764542:48-1592 CGATTAAGGAGGTGATCCAGCCGCAGGTTCCCCTACGGCTACCTTGTTACGACTTCACCCCAGTCATGAATCACACCGTGGTAACCGTCCTCCCGAAGGTTAGACTAGCTACTTCTGGTGCAACCCACTCCCATGGTGTGACGGGCGGTGTGTACAAGGCCCGGGAACGTATTCACCGCGACATTCTGATTCGCGATTACTAGCGATTCCGACTTCACGCAGTCGAGTTGCAGACTGCGATCCGGACTACGATCGGTTTTCTGGGATTAGCTCCACCTCGCGGCTTGGCAACCCTCTGTACCGACCATTGTAGCACGTGTGTAGCCCAGGCCGTAAGGGCCATGATGACTTGACGTCATCCCCACCTTCCTCCGGTTTGTCACCGGCAGTCTCCTTAGAGTGCCCACCATTACGTGCTGGTAACTAAGGACAAGGGTTGCGCTCGTTACGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAGCACCTGTCTCAATGTTCCCGAAGGCACCAATCCATCTCTGGAAAGTTCATTGGATGTCAAGGCCTGGTAAGGTTCTTCGCGTTGCTTCGAATTAAACCACATGCTCCACCGCTTGTGCGGGCCCCCGTCAATTCATTTGAGTTTTAACCTTGCGGCCGTACTCCCCAGGCGGTCAACTTAATGCGTTAGCTGCGCCACTAAGAGCTCAAGGCTCCCAACGGCTAGTTGACATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCACCTCAGTGTCAGTATCAGTCCAGGTGGTCGCCTTCGCCACTGGTGTTCCTTCCTATATCTACGCATTTCACCGCTACACAGGAAATTCCACCACCCTCTACCATACTCTAGCTCGTCAGTTTTGAATGCAGTTCCCAGGTTGAGCCCGGGGATTTCACATCCAACTTAACGAACCACCTACGCGCGCTTTACGCCCAGTAATTCCGATTAACGCTTGCACCCTCTGTATTACCGCGGCTGCTGGCACAGAGTTAGCCGGTGCTTATTCTGTCGGTAACGTCAAAACAATTACGTATTAGGTAACTGCCCTTCCTCCCAACTTAAAGTGCTTTACAATCCGAAGACCTTCTTCACACACGCGGCATGGCTGGATCAGGCTTTCGCCCATTGTCCAATATTCCCCACTGCTGCCTCCCGTAGGAGTCTGGACCGTGTCTCAGTTCCAGTGTGACTGATCATCCTCTCAGACCAGTTACGGATCGTCGCCTTGGTGAGCCATTACCTCACCAACTAGCTAATCCGACCTAGGCTCATCTGATAGCGCAAGGCCCGAAGGTCCCCTGCTTTCTCCCGTAGGACGTATGCGGTATTAGCGTCCGTTTCCGAGCGTTATCCCCCACTACCAGGCAGATTCCTAGGCATTACTCACCCGTCCGCCGCTCTCAAGAGGTGCAAGCACCTCTCTACCGCTCGACTTGCATGTGTTAGGCCTGCCGCCAGCGTTCAATCTGAGCCATGATCAAACTCTTCAGTTCAAA NODE_2_length_468307_cov_8.954830:159049-160594 TTTGAACTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGGTGCTTGCACCTCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAACGCTCGGAAACGGACGCTAATACCGCATACGTCCTACGGGAGAAAGCAGGGGACCTTCGGGCCTTGCGCTATCAGATGAGCCTAGGTCGGATTAGCTAGTTGGTGAGGTAATGGCTCACCAAGGCGACGATCCGTAACTGGTCTGAGAGGATGATCAGTCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGTCTTCGGATTGTAAAGCACTTTAAGTTGGGAGGAAGGGCAGTTACCTAATACGTAATTGTTTTGACGTTACCGACAGAATAAGCACCGGCTAACTCTGTGCCAGCAGCCGCGGTAATACAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTCGTTAAGTTGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCAAAACTGACGAGCTAGAGTATGGTAGAGGGTGGTGGAATTTCCTGTGTAGCGGTGAAATGCGTAGATATAGGAAGGAACACCAGTGGCGAAGGCGACCACCTGGACTGATACTGACACTGAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCAACTAGCCGTTGGGAGCCTTGAGCTCTTAGTGGCGCAGCTAACGCATTAAGTTGACCGCCTGGGGAGTACGGCCGCAAGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGCCTTGACATCCAATGAACTTTCCAGAGATGGATTGGTGCCTTCGGGAACATTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGTAACGAGCGCAACCCTTGTCCTTAGTTACCAGCACGTAATGGTGGGCACTCTAAGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGGCCTGGGCTACACACGTGCTACAATGGTCGGTACAGAGGGTTGCCAAGCCGCGAGGTGGAGCTAATCCCAGAAAACCGATCGTAGTCCGGATCGCAGTCTGCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGCGAATCAGAATGTCGCGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGTTGCACCAGAAGTAGCTAGTCTAACCTTCGGGAGGACGGTTACCACGGTGTGATTCATGACTGGGGTGAAGTCGTAACAAGGTAGCCGTAGGGGAACCTGCGGCTGGATCACCTCCTTAATCGA NODE_5_length_352739_cov_8.419214:68567-70112 TCGATTAAGGAGGTGATCCAGCCGCAGGTTCCCCTACGGCTACCTTGTTACGACTTCACCCCAGTCATGAATCACACCGTGGTAACCGTCCTCCCGAAGGTTAGACTAGCTACTTCTGGTGCAACCCACTCCCATGGTGTGACGGGCGGTGTGTACAAGGCCCGGGAACGTATTCACCGCGACATTCTGATTCGCGATTACTAGCGATTCCGACTTCACGCAGTCGAGTTGCAGACTGCGATCCGGACTACGATCGGTTTTCTGGGATTAGCTCCACCTCGCGGCTTGGCAACCCTCTGTACCGACCATTGTAGCACGTGTGTAGCCCAGGCCGTAAGGGCCATGATGACTTGACGTCATCCCCACCTTCCTCCGGTTTGTCACCGGCAGTCTCCTTAGAGTGCCCACCATTACGTGCTGGTAACTAAGGACAAGGGTTGCGCTCGTTACGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAGCACCTGTCTCAATGTTCCCGAAGGCACCAATCCATCTCTGGAAAGTTCATTGGATGTCAAGGCCTGGTAAGGTTCTTCGCGTTGCTTCGAATTAAACCACATGCTCCACCGCTTGTGCGGGCCCCCGTCAATTCATTTGAGTTTTAACCTTGCGGCCGTACTCCCCAGGCGGTCAACTTAATGCGTTAGCTGCGCCACTAAGAGCTCAAGGCTCCCAACGGCTAGTTGACATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCACCTCAGTGTCAGTATCAGTCCAGGTGGTCGCCTTCGCCACTGGTGTTCCTTCCTATATCTACGCATTTCACCGCTACACAGGAAATTCCACCACCCTCTACCATACTCTAGCTCGTCAGTTTTGAATGCAGTTCCCAGGTTGAGCCCGGGGATTTCACATCCAACTTAACGAACCACCTACGCGCGCTTTACGCCCAGTAATTCCGATTAACGCTTGCACCCTCTGTATTACCGCGGCTGCTGGCACAGAGTTAGCCGGTGCTTATTCTGTCGGTAACGTCAAAACAATTACGTATTAGGTAACTGCCCTTCCTCCCAACTTAAAGTGCTTTACAATCCGAAGACCTTCTTCACACACGCGGCATGGCTGGATCAGGCTTTCGCCCATTGTCCAATATTCCCCACTGCTGCCTCCCGTAGGAGTCTGGACCGTGTCTCAGTTCCAGTGTGACTGATCATCCTCTCAGACCAGTTACGGATCGTCGCCTTGGTGAGCCATTACCTCACCAACTAGCTAATCCGACCTAGGCTCATCTGATAGCGCAAGGCCCGAAGGTCCCCTGCTTTCTCCCGTAGGACGTATGCGGTATTAGCGTCCGTTTCCGAGCGTTATCCCCCACTACCAGGCAGATTCCTAGGCATTACTCACCCGTCCGCCGCTCTCAAGAGGTGCAAGCACCTCTCTACCGCTCGACTTGCATGTGTTAGGCCTGCCGCCAGCGTTCAATCTGAGCCATGATCAAACTCTTCAGTTCAAA NODE_32_length_47877_cov_11.378513:46285-47829 TTTGAACTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGGTGCTTGCACCTCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAACGCTCGGAAACGGACGCTAATACCGCATACGTCCTACGGGAGAAAGCAGGGGACCTTTGGGCCTTGGGCTATCAGATGAGCCTAGGTCGGATTAGCTAGTTGGTGAGGTAATGGCTCACCAAGGCGACGATCCGTAACTGGTCTGAGAGGATGATCAGTCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGTCTTCGGATTGTAAAGCACTTTAAGTTGGGAGGAAGGGCAGTTACCTAATACGTAATTGTTTTGACGTTACCGACAGAATAAGCACCGGCTAACTCTGTGCCAGCAGCCGCGGTAATACAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTCGTTAAGTTGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCAAAACTGACGAGCTAGAGTATGGTAGAGGGTGGTGGAATTTCCTGTGTAGCGGTGAAATGCGTAGATATAGGAAGGAACACCAGTGGCGAAGGCGACCACCTGGACTGATACTGACACTGAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCAACTAGCCGTTGGGAGCCTTGAGCTCTTAGTGGCGCAGCTAACGCATTAAGTTGACCGCCTGGGGAGTACGGCCGCAAGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGCCTTGACATCCAATGAACTTTCCAGAGATGGATTGGTGCCTTCGGGAACATTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGTAACGAGCGCAACCCTTGTCCTTAGTTACCAGCACGTAATGGTGGGCACTCTAAGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGGCCTGGGCTACACACGTGCTACAATGGTCGGTACAGAGGGTTGCCAAGCCGCGAGGTGGAGCTAATCCCAGAAAACCGATCGTAGTCCGGATCGCAGTCTGCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGCGAATCAGAATGTCGCGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGTTGCACCAGAAGTAGCTAGTCTAACCTTCGGGAGGACGGTTACCACGGTGTGATTCATGACTGGGGTGAAGTCGTAACAAGGTAGCCGTAGGGGAACCTGCGGCTGGATCACCTCCTTAATCG NODE_21_length_103352_cov_10.011827:35530-37075 TTTGAACTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGGTGCTTGCACCTCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAACGCTCGGAAACGGACGCTAATACCGCATACGTCCTACGGGAGAAAGCAGGGGACCTTCGGGCCTTGCGCTATCAGATGAGCCTAGGTCGGATTAGCTAGTTGGTGAGGTAATGGCTCACCAAGGCGACGATCCGTAACTGGTCTGAGAGGGTGATCAGTCAGATTTGAACTGGGACATGGTCAAGACTGCTAGGGGAGGCAGCAGTGGGGAATATTGGAGAATGGGCGAAAGCCTGATCCAGCCATGCAGCGTGTGTGAAGAAGGTCTTCGGATTGTAAAGGACTTTAAGTTGGGGGGAATGGCAGTTACCTAATACGTAATTGTGTTGACGTTACCGACAGAATAAGCACCGGCTAACTCTGTGCCAGCAGCCGCGGTAATACAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTCGTTAAGTTGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCAAAACTGACGAGCTAGAGTATGGTAGAGGGTGGTGGAATTTCCTGTGTAGCGGTGAAATGCGTAGATATAGGAAGGAACACCAGTGGCGAAGGCGACCACCTGGACTGATACTGACACTGAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCAACTAGCCGTTGGGAGCCTTGAGCTCTTAGTGGCGCAGCTAACGCATTAAGTTGACCGCCTGGGGAGTACGGCCGCAAGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGCCTTGACATCCAATGAACTTTCCAGAGATGGATTGGTGCCTTCGGGAACATTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGTAACGAGCGCAACCCTTGTCCTTAGTTACCAGCACGTAATGGTGGGCACTCTAAGGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGGCCTGGGCTACACACGTGCTACAATGGTCGGTACAGAGGGTTGCCAAGCCGCGAGGTGGAGCTAATCCCAGAAAACCGATCGTAGTCCGGATCGCAGTCTGCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGCGAATCAGAATGTCGCGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGTTGCACCAGAAGTAGCTAGTCTAACCTTCGGGAGGACGGTTACCACGGTGTGATTCATGACTGGGGTGAAGTCGTAACAAGGTAGCCGTAGGGGAACCTGCGGCTGGATCACCTCCTTAATCGA
Pseudomonas, from the hierarchy view of the result, all five fasta rRNA strains are computed as belong to the Pseudomonas genus, and thus it has a strong confidence to be a pseudomonas. Moreover, from the output seqmatch txt output, we can see that despite a variation between the matched species, all the matching result with high confidence are in the pseudomonas genus.
upload genome to two remote annotation services that will perform the annotation automatically:
from IPython.display import Image, display
img6 = Image(filename = "RAST_table.png")
img6
from IPython.display import Image, display
img7 = Image(filename = "RAST_graph.png")
img7
from IPython.display import Image, display
img8 = Image(filename = "RAST_output.png")
img8
from IPython.display import Image, display
img1 = Image(filename = "BASys_output.png")
img1
from IPython.display import Image, display
img2 = Image(filename = "BASys_summary.png")
img2
from IPython.display import Image, display
img5 = Image(filename = "BASys_table.png")
img5
from IPython.display import Image, display
img3 = Image(filename = "DFAST_output.png")
img3
from IPython.display import Image, display
img4 = Image(filename = "DFAST_table.png")
img4
After the server run was done, a close inspection on the table of annotated features is done, and the following traits are selected for the genome analysis write-up:
From the genome annotation we gather, there are multiple type IV pilus assembly and biogenesis proteins including TadE, PilZ, and PilB. Thus, it is clear that the genus has adapted to live in aqueous environment. Moreover, as our annotation did not indicate the characteristics of ocean strains, which have much more sensors than freshwater strains, the relatives are of higher possibility to live in a freshwater or land region.
There's no auxotrophy related genes being annotated in all three servers' outputs, thus our strain is of high possibility to produce all the essential amino acids by itself. Similarly, CRISPR cas enzymes and short tandem repeats were not enriched out from the annotation. Despite a previous report of the type 2 CRISPR system in pseudomonas, our strain does not seem to come up with the CRISPR system from the annotation result. There are no virulence or bacteriophage related horizontal transfer being observed from the feature table. However, toxin related genes including HicA and HipA are observeed.
From literature, HicA, a mRNA interferase toxin, is originated in E.Coli species and so there's a horizontal transfer of this toxin gene from another species to our strain [1]. Moreover, multiple antibiotic production and resistance genes are present, suggesting that this Pseudomonas strain is capable of antibiotic production and antibiotic resistance. Specifically, from pFam MbtH,antibiotic synthesis monooxygenase and type 2 antibiotic synthesis protein are responsible for producing antibiotics. There are a number of resistance related proteins produced by the strain; resistance-nodulation-cell division (RND) efflux membrane fusion protein and Bcr/CflA family drug resistance efflux transporter is related to the Efflux-mediated multiresistance in Gram-negative bacteria, and the strain has specific genes responsible for the resistance towards various antibiotin in plants related to fusaric acid, azaleucine, acriflavine and bleomycin. Finally, the presence of beta-lactamse,penicillin amidase, and cephalosporin hydroxylase also functions as potential antibiotic-degrading enzymes, so that the strain may be resistant to penicillin and cephalosporin.
Based on the phenotype analysis above, the bacteria is capable for binding to human tissue via pilus and form biofilm by Biofilm PGA synthesis auxiliary protein PgaD, can secrete toxic components and infect the mammalian tissue,and is also hard to treat due to the various display of resistance to antibiotics. The research also supports that biofilms of Pseudomonas can cause chronic opportunistic infections and cannot be treated effectively with traditional antibiotic therapy [2]. Thus, both genome annotation and secondary research lead to the consistent conclusion that the strain may be human pathogen.